Open In Colab

Diabetes Prediction Using Machine Learning¶

Overview¶

This project focuses on building a machine learning model to predict the likelihood of an individual being diabetic, pre-diabetic, or healthy. By analyzing healthcare statistics and lifestyle factors, the project aims to assist in early detection and intervention, enabling better diabetes management and prevention strategies.

Project Goals¶

  • Understand the relationship between healthcare and lifestyle statistics and diabetes risk.
  • Build a reliable classification model using advanced machine learning techniques.
  • Provide actionable insights through feature analysis and evaluation metrics.

Features¶

  • Data Preprocessing: Handling missing values, outliers, class imbalances, and encoding categorical variables.
  • Feature Selection: Identifying key factors influencing diabetes risk using correlation analysis and feature importance algorithms.
  • Model Development: Implementing and evaluating various machine learning models (e.g., Logistic Regression, Random Forest, Gradient Boosting, SVM).
  • Evaluation Metrics: Assessing models using precision, recall, F1-score, accuracy, and AUC for robust validation.
  • Presentation & Reporting: Summarizing the results, insights, and recommendations in an accessible format.

Methodology¶

  1. Data Preparation:
  • Collect and preprocess healthcare and lifestyle data.
  • Resolve discrepancies such as missing values, outliers, and imbalances.
  1. Feature Selection & Model Building:
  • Identify significant predictors of diabetes.
  • Compare machine learning algorithms to finalize the best-performing model.
  1. Model Evaluation:
  • Validate the model using multiple performance metrics.
  • Ensure robustness through cross-validation techniques.
  1. Documentation & Deployment:
  • Prepare detailed documentation and presentations.
  • Finalize the project for real-world applications.

Technologies Used¶

  • Programming Language: Python
  • Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, XGBoost
  • Tools: Jupyter Notebook, GitHub

Expected Outcomes¶

  • A machine learning model that accurately predicts diabetes risk.
  • Insights into the impact of lifestyle factors on diabetes.
  • A comprehensive framework for healthcare professionals to support early diagnosis and preventative care.

Importing Libraries¶

  • Pandas : Data manipulation and analysis.
  • Matplotlib : Basic data visualization.
  • Scikit-learn : Machine learning and preprocessing.
  • Plotly Express : Interactive data visualization.
In [4]:
#importing the packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Diabetes= pd.read_csv('diabetesInfosys.csv') # loading the dataset
Diabetes.head(10) # Displays top 10 records of the dataset
Out[4]:
Age Gender Polyuria Polydipsia sudden weight loss weakness Polyphagia Genital thrush visual blurring Itching Irritability delayed healing partial paresis muscle stiffness Alopecia Obesity class
0 40 Male No Yes No Yes No No No Yes No Yes No Yes Yes Yes Positive
1 58 Male No No No Yes No No Yes No No No Yes No Yes No Positive
2 41 Male Yes No No Yes Yes No No Yes No Yes No Yes Yes No Positive
3 45 Male No No Yes Yes Yes Yes No Yes No Yes No No No No Positive
4 60 Male Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Positive
5 55 Male Yes Yes No Yes Yes No Yes Yes No Yes No Yes Yes Yes Positive
6 57 Male Yes Yes No Yes Yes Yes No No No Yes Yes No No No Positive
7 66 Male Yes Yes Yes Yes No No Yes Yes Yes No Yes Yes No No Positive
8 67 Male Yes Yes No Yes Yes Yes No Yes Yes No Yes Yes No Yes Positive
9 70 Male No Yes Yes Yes Yes No Yes Yes Yes No No No Yes No Positive

Preparing the Dataset¶

  • Checking for missing/null values.

  • Examining the information in the columns.

  • The fundamental statistics of the numeric column.

In [6]:
Diabetes.isnull().sum()
Out[6]:
Age                   0
Gender                0
Polyuria              0
Polydipsia            0
sudden weight loss    0
weakness              0
Polyphagia            0
Genital thrush        0
visual blurring       0
Itching               0
Irritability          0
delayed healing       0
partial paresis       0
muscle stiffness      0
Alopecia              0
Obesity               0
class                 0
dtype: int64
In [7]:
Diabetes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Age                 520 non-null    int64 
 1   Gender              520 non-null    object
 2   Polyuria            520 non-null    object
 3   Polydipsia          520 non-null    object
 4   sudden weight loss  520 non-null    object
 5   weakness            520 non-null    object
 6   Polyphagia          520 non-null    object
 7   Genital thrush      520 non-null    object
 8   visual blurring     520 non-null    object
 9   Itching             520 non-null    object
 10  Irritability        520 non-null    object
 11  delayed healing     520 non-null    object
 12  partial paresis     520 non-null    object
 13  muscle stiffness    520 non-null    object
 14  Alopecia            520 non-null    object
 15  Obesity             520 non-null    object
 16  class               520 non-null    object
dtypes: int64(1), object(16)
memory usage: 69.2+ KB
In [8]:
Diabetes.describe()
Out[8]:
Age
count 520.000000
mean 48.028846
std 12.151466
min 16.000000
25% 39.000000
50% 47.500000
75% 57.000000
max 90.000000

EDA¶

This Exploratory Data Analysis (EDA) step focuses on preparing data for modeling by addressing:

  • Duplicates : Eliminate duplicates to maintain data uniqueness.

  • Missing Values : Identify and impute or remove based on feature relevance.

  • Outliers : Detect and manage with Z-score or IQR to avoid model bias.

  • Data Consistency : Standardize data types for reliable model compatibility.

This EDA phase ensures data quality and readiness for accurate modeling.

In [11]:
import matplotlib.pyplot as plt

# Count the occurrences of each class (positive/negative)
class_counts = Diabetes['class'].value_counts()

# Custom colors for the pie chart
colors = ['#1f77b4', '#ff7f0e']  # Blue and Orange

# Create the pie chart
plt.figure(figsize=(6, 6))
plt.pie(class_counts, labels=class_counts.index, autopct='%1.1f%%', startangle=140, colors=colors)
plt.title("Ratio of Positive and Negative Cases")
plt.show()
No description has been provided for this image
In [12]:
   # For Creating Interactive graphs
gendis= px.histogram(Diabetes, x = 'Gender', color = 'class', title="Distribution of Positive vs. Negative Diabetes Cases by Gender")
gendis.show()
pltbl= ['Gender', 'class']
cm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[pltbl[0]],Diabetes[pltbl[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = cm)
Out[12]:
class Negative Positive
Gender    
Female 9.500000 54.060000
Male 90.500000 45.940000

The data shows that female patients have a higher positivity rate than male patients, suggesting a bias toward female patients with higher positivity.

In [14]:
polyuria=px.histogram(Diabetes, x = 'Polyuria', color = 'class', title="Polyuria Frequency by Diabetes Status",
                       labels={"Polyuria": "Polyuria (Frequent Urination)", "count": "Number of Cases", "class": "Diabetes Status"})
polyuria.show()

plttbl_polyuria= ['Polyuria', 'class']
cm = sns.light_palette("green", as_cmap=True)

(round(pd.crosstab(Diabetes[plttbl_polyuria[0]], Diabetes[plttbl_polyuria[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = cm)
Out[14]:
class Negative Positive
Polyuria    
No 92.500000 24.060000
Yes 7.500000 75.940000

If a patient has polyuria (frequent urination), there's a 76% chance they could have diabetes. If they don't have polyuria, there's a 92% chance they won't get diabetes.

In [16]:
polydispia = px.histogram(Diabetes, x = 'Polydipsia', color = 'class', title="Frequency of Increased Water Consumption (Polydipsia) by Diabetes Status",
    labels={"Polydipsia": "Polydipsia (Increased Water Consumption)", "count": "Number of Cases", "class": "Diabetes Status"})
polydispia.show()

plttblpolydispia= ['Polydipsia', 'class']
rm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plttblpolydispia[0]], Diabetes[plttblpolydispia[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = rm)
Out[16]:
class Negative Positive
Polydipsia    
No 96.000000 29.690000
Yes 4.000000 70.310000

If a person has polydipsia (excessive thirst), there's a 70% chance they will develop diabetes. If they don’t have polydipsia, there's a 96% chance they won’t get diabetes.

In [18]:
swl = px.histogram(Diabetes, x = 'sudden weight loss', color = 'class', title="Distribution of Sudden Weight Loss by Diabetes Status",
    labels={"sudden weight loss": "Sudden Weight Loss", "count": "Number of Cases", "class": "Diabetes Status"})
swl.show()

plttblswl= ['sudden weight loss', 'class']
qm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plttblswl[0]], Diabetes[plttblswl[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = qm)
Out[18]:
class Negative Positive
sudden weight loss    
No 85.500000 41.250000
Yes 14.500000 58.750000

Unexpected weight loss is linked to a 58% chance of having diabetes. However, other common illnesses can also cause weight loss, so it's not always a definitive sign of diabetes. Unexpected weight loss is an important indicator, but it is less significant than Polyuria (frequent urination) or Polydipsia (excessive thirst) when predicting diabetes.

In [20]:
swl = px.histogram(Diabetes, x = 'weakness', color = 'class', title="Distribution of Weakness by Diabetes Status",
    labels={"weakness": "Weakness", "count": "Number of Cases", "class": "Diabetes Status"})
swl.show()
wkns = ['weakness', 'class']
sm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[wkns [0]],Diabetes[wkns [1]], normalize='columns') * 100,2)).style.background_gradient(cmap = sm)
Out[20]:
class Negative Positive
weakness    
No 56.500000 31.870000
Yes 43.500000 68.120000

Individuals with weakness have a 68% chance of testing positive for diabetes.

In [22]:
eating = px.histogram(Diabetes, x = 'Polyphagia', color = 'class', title="Distribution of Polyphagia (Excessive Eating) by Diabetes Status",

    labels={"Polyphagia": "Polyphagia (Excessive Eating)", "count": "Number of Cases", "class": "Diabetes Status"})
eating.show()

plt_eating= ['Polyphagia', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_eating[0]], Diabetes[plt_eating[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
Out[22]:
class Negative Positive
Polyphagia    
No 76.000000 40.940000
Yes 24.000000 59.060000

Individuals with an obsessive eating disorder have a 59% chance of developing diabetes, but a 76% chance of not developing it, indicating a lower risk for diabetes.

In [24]:
gntlthrsh = px.histogram(Diabetes, x = 'Genital thrush',color='class',title="Genital Thrush Distribution by Diabetes Status",

    labels={"Genital thrush": "Genital Thrush", "count": "Number of Cases", "class": "Diabetes Status"})
gntlthrsh.show()

plt_thrsh= ['Genital thrush', 'class']
um = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_thrsh[0]], Diabetes[plt_thrsh[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = um)
Out[24]:
class Negative Positive
Genital thrush    
No 83.500000 74.060000
Yes 16.500000 25.940000

Individuals with genital thrush have a 25.94% chance of testing positive for diabetes, while those without genital thrush have a 74.06% chance of testing positive.

In [26]:
visual = px.histogram(Diabetes, x = 'visual blurring', color = 'class',  title="Visual Blurring Distribution by Diabetes Status",

    labels={"visual blurring": "Visual Blurring", "count": "Number of Cases", "class": "Diabetes Status"})
visual.show()

plt_blurring= ['visual blurring', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_blurring[0]], Diabetes[plt_blurring[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
Out[26]:
class Negative Positive
visual blurring    
No 71.000000 45.310000
Yes 29.000000 54.690000

Individuals with visual blurring have a 54.69% chance of testing positive for diabetes, while those without visual blurring have a 45.31% chance of testing positive.

In [28]:
creeping = px.histogram(Diabetes, x = 'Itching', color = 'class', title="Distribution of Itching (Creeping) Symptom by Diabetes Status",

    labels={"Itching": "Itching (Creeping)", "count": "Number of Cases", "class": "Diabetes Status"})
creeping.show()

plt_creeping= ['Itching', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_creeping[0]], Diabetes[plt_creeping[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
Out[28]:
class Negative Positive
Itching    
No 50.500000 51.880000
Yes 49.500000 48.120000

Individuals with itching have a 48.12% chance of testing positive for diabetes, while those without itching have a 51.88% chance of testing positive. This shows that itching has a minimal impact on the likelihood of testing positive for diabetes.

In [30]:
irritiability = px.histogram(Diabetes, x = 'Irritability', color = 'class', title="Distribution of Irritability Symptom by Diabetes Status",

    labels={"Irritability": "Irritability", "count": "Number of Cases", "class": "Diabetes Status"})
irritiability.show()

plt_irritiability= ['Irritability', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_irritiability[0]], Diabetes[plt_irritiability[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
Out[30]:
class Negative Positive
Irritability    
No 92.000000 65.620000
Yes 8.000000 34.380000

Individuals with irritability have a 34.38% chance of testing positive for diabetes, while those without irritability have a 65.62% chance of testing positive. This suggests that irritability is associated with a lower likelihood of testing positive for diabetes.

In [32]:
dh = px.histogram(Diabetes, x = 'delayed healing', color = 'class', title="trouble staying closed")
dh.show()

plt_dh= ['delayed healing', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_dh[0]], Diabetes[plt_dh[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
Out[32]:
class Negative Positive
delayed healing    
No 57.000000 52.190000
Yes 43.000000 47.810000

Individuals with delayed healing have a 47.81% chance of testing positive for diabetes, while those without delayed healing have a 52.19% chance of testing positive. This indicates that delayed healing has a minimal impact on the likelihood of testing positive for diabetes.

In [34]:
paresis = px.histogram(Diabetes, x = 'partial paresis', color = 'class', title="partial paresis")
paresis.show()

plt_paresis= ['partial paresis', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_paresis[0]], Diabetes[plt_paresis[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
Out[34]:
class Negative Positive
partial paresis    
No 84.000000 40.000000
Yes 16.000000 60.000000

Individuals with partial paresis have a 60% chance of testing positive for diabetes, while those without partial paresis have a 40% chance of testing positive.

In [36]:
muscle_stiffness = px.histogram(Diabetes, x = 'muscle stiffness', color = 'class', title="muscle stiffness")
muscle_stiffness.show()

plt_stiffness= ['muscle stiffness', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_stiffness[0]], Diabetes[plt_stiffness[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
Out[36]:
class Negative Positive
muscle stiffness    
No 70.000000 57.810000
Yes 30.000000 42.190000

Individuals with muscle stiffness have a 42.19% chance of testing positive for diabetes, while those without muscle stiffness have a 57.81% chance of testing positive. This indicates that muscle stiffness is associated with a slightly lower likelihood of testing positive for diabetes.

In [38]:
Hair_loss = px.histogram(Diabetes, x = 'Alopecia', color = 'class', title="Hair Loss")
Hair_loss.show()

plt_Hair_loss= ['Alopecia', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_Hair_loss[0]], Diabetes[plt_Hair_loss[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
Out[38]:
class Negative Positive
Alopecia    
No 49.500000 75.620000
Yes 50.500000 24.380000

Individuals with alopecia have a 24.38% chance of testing positive for diabetes, while those without alopecia have a 75.62% chance of testing positive. This suggests that alopecia is associated with a lower likelihood of testing positive for diabetes

In [40]:
Obesity = px.histogram(Diabetes, x = 'Obesity', color = 'class', title="excessive body fat")
Obesity.show()

plt_body_fat= ['Obesity', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_body_fat[0]], Diabetes[plt_body_fat[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
Out[40]:
class Negative Positive
Obesity    
No 86.500000 80.940000
Yes 13.500000 19.060000

Individuals with obesity have a 19.06% chance of testing positive for diabetes, while those without obesity have an 80.94% chance of testing positive. This suggests that obesity is associated with a reduced likelihood of testing positive for diabetes in this dataset.

Label Encoding¶

In [43]:
from sklearn import preprocessing
from sklearn import model_selection
number = preprocessing.LabelEncoder()
dtacpy1 = Diabetes.copy()   # Duplicating the Dataset
dtacpy1.head(5)
Out[43]:
Age Gender Polyuria Polydipsia sudden weight loss weakness Polyphagia Genital thrush visual blurring Itching Irritability delayed healing partial paresis muscle stiffness Alopecia Obesity class
0 40 Male No Yes No Yes No No No Yes No Yes No Yes Yes Yes Positive
1 58 Male No No No Yes No No Yes No No No Yes No Yes No Positive
2 41 Male Yes No No Yes Yes No No Yes No Yes No Yes Yes No Positive
3 45 Male No No Yes Yes Yes Yes No Yes No Yes No No No No Positive
4 60 Male Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Positive
In [44]:
for i in dtacpy1:
    dtacpy1[i] = number.fit_transform(dtacpy1[i])
dtacpy1.head()
Out[44]:
Age Gender Polyuria Polydipsia sudden weight loss weakness Polyphagia Genital thrush visual blurring Itching Irritability delayed healing partial paresis muscle stiffness Alopecia Obesity class
0 16 1 0 1 0 1 0 0 0 1 0 1 0 1 1 1 1
1 34 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1
2 17 1 1 0 0 1 1 0 0 1 0 1 0 1 1 0 1
3 21 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1
4 36 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
In [45]:
X = dtacpy1.drop(['class'],axis=1) # Independent
y= dtacpy1['class'] # Dependent
X.head()
Out[45]:
Age Gender Polyuria Polydipsia sudden weight loss weakness Polyphagia Genital thrush visual blurring Itching Irritability delayed healing partial paresis muscle stiffness Alopecia Obesity
0 16 1 0 1 0 1 0 0 0 1 0 1 0 1 1 1
1 34 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0
2 17 1 1 0 0 1 1 0 0 1 0 1 0 1 1 0
3 21 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0
4 36 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
In [46]:
y.head()
Out[46]:
0    1
1    1
2    1
3    1
4    1
Name: class, dtype: int32
In [47]:
# Calculate the correlation of each feature with the target variable
correlation = X.corrwith(y)

# Print the correlation values for reference
print("Feature Correlations with Target Variable:\n", correlation)

# Enhanced Bar Plot for Correlation with custom color
plt.figure(figsize=(15, 5))
correlation.plot(
    kind="bar",
    color="coral",  # Change bar color to coral
    edgecolor="darkred",
    linewidth=1,
    title="Feature Correlation with Target Variable (Class)"
)

# Add grid and adjust plot aesthetics
plt.title("Correlation of Features with Target Variable", fontsize=16, fontweight='bold')
plt.xlabel("Features", fontsize=12)
plt.ylabel("Correlation Coefficient", fontsize=12)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.xticks(rotation=45, ha="right")
plt.tight_layout()

# Display the plot
plt.show()
Feature Correlations with Target Variable:
 Age                   0.106419
Gender               -0.449233
Polyuria              0.665922
Polydipsia            0.648734
sudden weight loss    0.436568
weakness              0.243275
Polyphagia            0.342504
Genital thrush        0.110288
visual blurring       0.251300
Itching              -0.013384
Irritability          0.299467
delayed healing       0.046980
partial paresis       0.432288
muscle stiffness      0.122474
Alopecia             -0.267512
Obesity               0.072173
dtype: float64
No description has been provided for this image
  • From the graph above, we can identify a strong correlation between the variable "Class" (indicating diabetes presence) and specific factors, listed in order of strongest positive relationship:

    • Polyuria (frequent urination)
    • Polydipsia (increased thirst)
    • Sudden weight loss
    • Partial paresis (muscle weakness)
  • These factors are positively correlated with the likelihood of diabetes, meaning patients showing these symptoms are more likely to be diagnosed as diabetic. This insight is key for identifying individuals at higher risk based on common symptoms.

  • On the other hand, variables that show a negative correlation—such as Alopecia (hair loss)—appear much less significant. A negative correlation with "Class" suggests that if a patient tests positive for alopecia alone, they are unlikely to be diabetic. Thus, alopecia is not a meaningful indicator of diabetes risk in isolation.

In [49]:
symptoms = ["Polyuria", "Polydipsia", "sudden weight loss", "weakness", "Polyphagia",
            "Genital thrush", "visual blurring", "Itching", "Irritability",
            "delayed healing", "partial paresis", "muscle stiffness", "Alopecia", "Obesity"]

df_binary = pd.get_dummies(Diabetes[symptoms], drop_first=True)
df_binary['Target'] = Diabetes['class'].apply(lambda x: 1 if x == "Positive" else 0)

# Calculate pairwise correlations
corr_matrix_binary = df_binary.corr()

# Plotting heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix_binary, cmap="PiYG", annot=True, linewidths=0.5, center=0)

plt.title("Pairwise Correlation Heatmap for  Features and Target", fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image

The pairwise correlation heatmap for binary features provides the following insights about the relationships between symptoms and diabetes:

  • Direct Symptom-Diabetes Correlation :

    • The correlation values in the "Target" row show how strongly each symptom is associated with a diabetes diagnosis (positive correlation) or with the absence of diabetes (negative correlation).
    • Positive Correlations (values closer to +1): Symptoms with higher positive correlations are more commonly present in individuals diagnosed with diabetes. For instance, if symptoms like Polyuria or Polydipsia have high positive correlations, this indicates these symptoms are strong indicators of diabetes.
    • Negative Correlations (values closer to -1): Symptoms with negative correlations may be more frequent in individuals without diabetes. For instance, if Alopecia shows a negative correlation, it could indicate that individuals with alopecia are less likely to be diagnosed with diabetes.
  • Inter-Symptom Relationships : Symptoms with high correlations to each other may indicate a tendency to co-occur. For example, if Polyuria and Polydipsia show a strong correlation with each other, it suggests these symptoms often appear together in diabetic patients, possibly due to similar physiological effects.Weak or Neutral

  • Correlations : Features with correlation values near zero with the target variable may not contribute much to diabetes prediction and could be less useful in diagnostic contexts. These features might represent common symptoms that don’t have a strong association with diabetes specifically, such as symptoms more related to other health issues.

  • Potential Predictive Indicators : The symptoms with the strongest positive or negative correlations with the target variable are the most useful for diagnosis and model prediction. Positive indicators (e.g., symptoms highly correlated with diabetes) could become focus points for early screening.

In [51]:
# Enhanced box plot with all dataset features in tooltips
genbox = px.box(
    Diabetes,
    y="Age",
    x="class",
    color="Gender",
    points="all",
    title="Age Distribution by Diabetes Status, Gender, and Additional Symptoms",

    # Custom color mapping for gender
    color_discrete_map={"Male": "blue", "Female": "pink"},

    # Adding facets for additional segmentation (e.g., by "sudden weight loss")
    facet_row="Polyuria",  # Faceting by Polyuria (could change based on interest)
    facet_col="Polydipsia",  # Faceting by Polydipsia

    # Including all relevant attributes as hover data for insight
    hover_data={
        "Polyuria": True,
        "Polydipsia": True,
        "sudden weight loss": True,
        "weakness": True,
        "Polyphagia": True,
        "Genital thrush": True,
        "visual blurring": True,
        "Itching": True,
        "Irritability": True,
        "partial paresis": True,
        "Alopecia": True,
        "class": True
    }
)

# Show the enhanced plot
genbox.show()
  • The box plot shows that age and gender influence diabetes status, with younger females and older males showing distinct patterns.
  • Symptoms like frequent urination (Polyuria) and excessive thirst (Polydipsia) are commonly seen in diabetes-positive cases, while symptoms like hair loss (Alopecia) are less common among them.
  • This plot helps us ientify typical diabetes symptoms and points to specific combinations of age, gender, and symptoms that may assist in early detection of diabetes.

Feature Selection¶

  • Feature selection is the process of identifying and selecting the most important features in a dataset. It aims to improve model performance by removing irrelevant or redundant features. This helps reduce overfitting, improve accuracy, and decrease computational cost.
In [54]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier

# Perform Chi-Square feature selection
chi2_selector = SelectKBest(chi2, k='all')
chi2_selector.fit(X, y)
chi2_scores = chi2_selector.scores_

# Perform Random Forest-based feature importance
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X, y)
rf_importances = rf_model.feature_importances_

# Combine feature importance metrics
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Chi2 Score': chi2_scores,
    'RF Importance': rf_importances
})

# Sort features by Random Forest Importance (descending)
feature_importance_sorted = feature_importance.sort_values(by='RF Importance', ascending=False)

# Display sorted feature importance
print("Feature Importance (Ordered by Random Forest Importance):")
print("\n")
print(feature_importance_sorted)


import matplotlib.pyplot as plt

# Sort features and scores by Chi-Square scores (descending)
chi2_sorted = feature_importance.sort_values(by='Chi2 Score', ascending=False)
chi2_features = chi2_sorted['Feature']
chi2_scores_sorted = chi2_sorted['Chi2 Score']

# Sort features and scores by Random Forest importance (descending)
rf_sorted = feature_importance.sort_values(by='RF Importance', ascending=False)
rf_features = rf_sorted['Feature']
rf_importances_sorted = rf_sorted['RF Importance']

# Plot Chi-Square Scores and Random Forest Importances
plt.figure(figsize=(14, 8))

# Chi-Square Scores plot
plt.subplot(1, 2, 1)
plt.barh(chi2_features, chi2_scores_sorted, color='skyblue')
plt.title('Chi-Square Feature Importance')
plt.xlabel('Chi-Square Score')
plt.gca().invert_yaxis()  # Ensures highest priority is at the top

# Random Forest Importances plot
plt.subplot(1, 2, 2)
plt.barh(rf_features, rf_importances_sorted, color='lightcoral')
plt.title('Random Forest Feature Importance')
plt.xlabel('Feature Importance')
plt.gca().invert_yaxis()  # Ensures highest priority is at the top

plt.tight_layout()
plt.show()
Feature Importance (Ordered by Random Forest Importance):


               Feature  Chi2 Score  RF Importance
2             Polyuria  116.184593       0.203702
3           Polydipsia  120.785515       0.202658
1               Gender   38.747637       0.104876
0                  Age   33.971724       0.095522
12     partial paresis   55.314286       0.055078
4   sudden weight loss   57.749309       0.052379
14            Alopecia   24.402793       0.044073
10        Irritability   35.334127       0.039392
6           Polyphagia   33.198418       0.030942
8      visual blurring   18.124571       0.030147
9              Itching    0.047826       0.029924
11     delayed healing    0.620188       0.028387
13    muscle stiffness    4.875000       0.026561
7       Genital thrush    4.914009       0.021007
5             weakness   12.724262       0.018950
15             Obesity    2.250284       0.016403
No description has been provided for this image
  • Chi-Square (Chi2): Looks at each feature (like Gender or Age) by itself to see if it has a strong, direct link to the outcome (like diabetes). If a feature doesn’t stand out alone, it gets a low score.

  • Random Forest (RF): Looks at how features work together. Even if Gender or Age don’t seem very important alone, they might play a big role when combined with other features (like sudden weight loss or Polyuria) to make better predictions.

So, Chi2 checks individual importance, while RF focuses on teamwork among the features.

Why Random Forest?

  • If your goal is statistical analysis and you need a quick, simple check, Chi2 might suffice. But if you’re building a predictive model, Random Forest provides richer insights into how features influence outcomes, especially when features interact or relationships are complex.

  • By combining both methods, you strike a balance between efficiency (Chi2) and effectiveness (RF). This approach avoids unnecessary complexity while ensuring you keep features that significantly impact the model.

PCA Analysis and Feature Reduction for Diabetes Prediction Model¶

  • Dimensionality reduction refers to the process of reducing the number of input variables (features) in a dataset while retaining as much of the original information as possible.
In [58]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Load the dataset
file_path = 'diabetesInfosys.csv'
pcadata = pd.read_csv(file_path)
data=pcadata


from sklearn.preprocessing import LabelEncoder

# Encode categorical features
encoder = LabelEncoder()
for col in data.columns:
    if data[col].dtype == 'object':
        data[col] = encoder.fit_transform(data[col])

# Separate features and target
X = data.drop(columns=['class'])
y = data['class']

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

variance_before_pca = X.var(axis=0)  # Variance of each feature in the original data
print(f"Variance before PCA (for each feature):\n{variance_before_pca}")

print("-------------------------------------------------------------------")

# Apply PCA to retain 95% of variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

# Display explained variance
print(f"Explained variance by each component: {pca.explained_variance_ratio_}")
print(f"Total components selected: {pca.n_components_}")
print(f"Original shape: {X.shape}, Reduced shape: {X_pca.shape}")

# Plot cumulative explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), 
         np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.title('Cumulative Explained Variance by Principal Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()



# Identify the removed features based on the number of components retained
original_columns = X.columns  # If X is a pandas DataFrame
retained_features_count = pca.n_components_
Variance before PCA (for each feature):
Age                   147.658126
Gender                  0.233348
Polyuria                0.250467
Polydipsia              0.247780
sudden weight loss      0.243631
weakness                0.242978
Polyphagia              0.248522
Genital thrush          0.173648
visual blurring         0.247780
Itching                 0.250300
Irritability            0.183948
delayed healing         0.248848
partial paresis         0.245680
muscle stiffness        0.234827
Alopecia                0.226171
Obesity                 0.140863
dtype: float64
-------------------------------------------------------------------
Explained variance by each component: [0.24421092 0.13922824 0.09026398 0.0753711  0.0602457  0.05201242
 0.04808062 0.04611141 0.04151837 0.03601345 0.0329961  0.03145423
 0.03060287 0.02580352]
Total components selected: 14
Original shape: (520, 16), Reduced shape: (520, 14)
No description has been provided for this image

Why PCA is Unnecessary for this Datasets?

  • PCA is not needed , as the dataset has only 16 features, and applying it may lose important original feature information.

Model Building¶

evaluated six models:¶

  • Logistic Regression
  • Random Forest
  • Gradient Boosting
  • Support Vector Classifier (SVC)
  • Extra trees
  • Decision Tree

Each model was checked on Four metrics: Accuracy, Precision, Recall , and F1 Score.¶

  • Accuracy: How often the model is correct.
  • Precision: How many predicted positives are actually correct.
  • Recall: How many actual positives the model correctly identified.
  • F1 Score: F1 score balances precision and recall for accurate positive predictions.
In [62]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import MinMaxScaler

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Scaling the data¶

  • Scaling ensures that all features are on the same scale, preventing any feature from dominating the model due to its larger range. It improves model convergence, performance, and ensures fair contributions from each feature.
In [64]:
# Scale data for Logistic Regression and SVC
scaler = MinMaxScaler()
X_train_log_reg = scaler.fit_transform(X_train)
X_test_log_reg = scaler.transform(X_test)

Hyperparameter Tuning¶

  • Hyperparameter tuning refers to the process of selecting the best values for the hyperparameters of a machine learning model.
  • For hyperparameter tuning in our code, we used a GridSearchCV method.
  • GridSearchCV tests different settings for a model, checks which one works best using your data, and picks the winner for the most accurate predictions.
In [66]:
# Define the hyperparameter grids for each model
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5]
}

param_grid_gb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

param_grid_lr = {
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'saga']
}

param_grid_dt = {
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

param_grid_svc = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

param_grid_et = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}
# Initialize models
rf = RandomForestClassifier(random_state=42)
gb = GradientBoostingClassifier(random_state=42)
lr = LogisticRegression(random_state=42)
dt = DecisionTreeClassifier(random_state=42)
svc = SVC(random_state=42)
et = ExtraTreesClassifier(random_state=42)

# Initialize GridSearchCV for each model
grid_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_gb = GridSearchCV(estimator=gb, param_grid=param_grid_gb, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_lr = GridSearchCV(estimator=lr, param_grid=param_grid_lr, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_dt = GridSearchCV(estimator=dt, param_grid=param_grid_dt, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_svc = GridSearchCV(estimator=svc, param_grid=param_grid_svc, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_et = GridSearchCV(estimator=et, param_grid=param_grid_et, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)

# Fit the grid search to the data
grid_rf.fit(X_train, y_train)
grid_gb.fit(X_train, y_train)
grid_lr.fit(X_train_log_reg, y_train)  # Using scaled data for Logistic Regression
grid_dt.fit(X_train, y_train)
grid_svc.fit(X_train_log_reg, y_train)  # SVC needs scaled data
grid_et.fit(X_train, y_train)

# Print the best parameters for each model
print("Best parameters for Random Forest:", grid_rf.best_params_)
print("Best parameters for Gradient Boosting:", grid_gb.best_params_)
print("Best parameters for Logistic Regression:", grid_lr.best_params_)
print("Best parameters for Decision Tree:", grid_dt.best_params_)
print("Best parameters for SVC:", grid_svc.best_params_)
print("Best parameters for Extra Trees:", grid_et.best_params_)

# Best models obtained from GridSearchCV
best_rf = grid_rf.best_estimator_
best_gb = grid_gb.best_estimator_
best_lr = grid_lr.best_estimator_
best_dt = grid_dt.best_estimator_
best_svc = grid_svc.best_estimator_
best_et = grid_et.best_estimator_
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 18 candidates, totalling 54 fits
Fitting 3 folds for each of 6 candidates, totalling 18 fits
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best parameters for Random Forest: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 100}
Best parameters for Gradient Boosting: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
Best parameters for Logistic Regression: {'C': 1, 'solver': 'liblinear'}
Best parameters for Decision Tree: {'criterion': 'entropy', 'max_depth': 10, 'min_samples_split': 2}
Best parameters for SVC: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Best parameters for Extra Trees: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}

Train and Evaluate Each Model¶

In [68]:
# Initialize a list to store results
results_list = []

# Train and evaluate each model
for name, model in {'Random Forest': best_rf, 'Gradient Boosting': best_gb, 'Logistic Regression': best_lr, 
                    'Decision Tree': best_dt, 'SVC': best_svc,'Extra Trees': best_et}.items():
    if name == 'Logistic Regression' or name == 'SVC':
        # Logistic Regression and SVC require scaled data
        model.fit(X_train_log_reg, y_train)
        y_pred = model.predict(X_test_log_reg)
    else:
        # Other models work with original (unscaled) data
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    # Evaluate performance
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # Add results to the list
    results_list.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    })

Create a DataFrame with the Results¶

In [70]:
# Step 4: Create a DataFrame with the results
results_df = pd.DataFrame(results_list)

# Display the final results DataFrame
print(results_df)

print("-----------------------------------------------------------------------------------")
top_models = results_df.sort_values(by='Accuracy', ascending=False).head(2)

# Print the top two models with the highest accuracy
print("\nTop  models with the highest accuracy:")
print(top_models)
                 Model  Accuracy  Precision    Recall  F1 Score
0        Random Forest  0.990385   1.000000  0.985915  0.992908
1    Gradient Boosting  0.980769   1.000000  0.971831  0.985714
2  Logistic Regression  0.932692   0.944444  0.957746  0.951049
3        Decision Tree  0.980769   1.000000  0.971831  0.985714
4                  SVC  0.971154   0.985714  0.971831  0.978723
5          Extra Trees  0.990385   1.000000  0.985915  0.992908
-----------------------------------------------------------------------------------

Top  models with the highest accuracy:
           Model  Accuracy  Precision    Recall  F1 Score
0  Random Forest  0.990385        1.0  0.985915  0.992908
5    Extra Trees  0.990385        1.0  0.985915  0.992908
In [71]:
import seaborn as sna
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Accuracy', data=results_df, palette='viridis')

# Title and labels
plt.title('Accuracy by Model')
plt.ylabel('Accuracy')
plt.xticks(rotation=45, ha="right")

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image
  • Although both Random Forest and Extra Trees have the same metrics in terms of accuracy, precision, recall, and F1 score, Random Forest is preferred

Why Random Forest Over Extra Trees?

In our diabetes prediction model, we choose Random Forest over Extra Trees based on three key metrics:

Better Interpretability :

  • Random Forest provides clear insights into the importance of features like Polyuria and Polydipsia, helping us understand which factors most influence diabetes prediction. For medical applications, transparency in model decision-making is crucial, as it helps healthcare providers trust the model’s results.

Less Randomness in Decisions :

  • Extra Trees add more randomness by using fully randomized splits in trees, which can reduce interpretability. In healthcare, stability and consistency are more important than squeezing out slight accuracy gains, especially when the model's decisions affect lives.

Stability :

  • Random Forest tends to be more stable across different data splits due to the way it combines predictions from multiple trees with bootstrapping.

Conclusion :

Random Forest offers a balance of performance, interpretability, and stability. This makes it the preferred choice for understanding and explaining the factors influencing diabetes diagnosis, ensuring the model's decisions are trustworthy and reliable in real-world medical settings.

In [ ]: